training and test data
GOOD: AGraph Out-of-Distribution Benchmark
Out-of-distribution (OOD) learning deals with scenarios in which training and test data follow different distributions. Although general OOD problems have been intensively studied in machine learning, graph OOD is only an emerging area of research. Currently, there lacks a systematic benchmark tailored to graph OOD method evaluation. In this work, we aim at developing an OOD benchmark, known as GOOD, for graphs specifically. We explicitly make distinctions between covariate and concept shifts and design data splits that accurately reflect different shifts. We consider both graph and node prediction tasks as there are key differences in designing shifts. Overall, GOOD contains 11 datasets with 17 domain selections. When combined with covariate, concept, and no shifts, we obtain 51 different splits. We provide performance results on 10 commonly used baseline methods with 10 random runs.
Reviewer 1: Thank you for the insightful analysis and acknowledgement of our effort
Reviewer 1: Thank you for the insightful analysis and acknowledgement of our effort. The model is clearly expressive enough as training and test accuracy are near-perfect. We did not intend to make this impression that over-fitting is new or surprising for these datasets. In Sec 2.2 and Theorem 2.1, we rigorously showed the existence of a perfect accuracy For example, compare P@k and PSP@k of PfastreXML and FastXML in T able 3 and T able 4. We Note that near-orthogonality is condition No. 5 mentioned in Theorem 2.1 for the existence of a perfect XMC repository and we have verified its correctness and we will release it in the final version. We are happy to provide more details if this is a point of concern.
Machine Learning Models for Accurately Predicting Properties of CsPbCl3 Perovskite Quantum Dots
รadฤฑrcฤฑ, Mehmet Sฤฑddฤฑk, รadฤฑrcฤฑ, Musa
Perovskite Quantum Dots (PQDs) have a promising future for several applications due to their unique properties. This study investigates the effectiveness of Machine Learning (ML) in predicting the size, absorbance (1S abs) and photoluminescence (PL) properties of $\mathrm{CsPbCl}_3$ PQDs using synthesizing features as the input dataset. the study employed ML models of Support Vector Regression (SVR), Nearest Neighbour Distance (NND), Random Forest (RF), Gradient Boosting Machine (GBM), Decision Tree (DT) and Deep Learning (DL). Although all models performed highly accurate results, SVR and NND demonstrated the best accurate property prediction by achieving excellent performance on the test and training datasets, with high $\mathrm{R}^2$ and low Root Mean Squared Error (RMSE) and low Mean Absolute Error (MAE) metric values. Given that ML is becoming more superior, its ability to understand the QDs field could prove invaluable to shape the future of nanomaterials designing.
Fairness Hub Technical Briefs: Definition and Detection of Distribution Shift
Acevedo, Nicolas, Cortez, Carmen, Brooks, Chris, Kizilcec, Rene, Yu, Renzhe
Distribution shift is a common situation in machine learning tasks, where the data used for training a model is different from the data the model is applied to in the real world. This issue arises across multiple technical settings: from standard prediction tasks, to time-series forecasting, and to more recent applications of large language models (LLMs). This mismatch can lead to performance reductions, and can be related to a multiplicity of factors: sampling issues and non-representative data, changes in the environment or policies, or the emergence of previously unseen scenarios. This brief focuses on the definition and detection of distribution shifts in educational settings. We focus on standard prediction problems, where the task is to learn a model that takes in a series of input (predictors) $X=(x_1,x_2,...,x_m)$ and produces an output $Y=f(X)$.
$L_0$ Regularization of Field-Aware Factorization Machine through Ising Model
We examined the use of the Ising model as an $L_0$ regularization method for field-aware factorization machines (FFM). This approach improves generalization performance and has the advantage of simultaneously determining the best feature combinations for each of several groups. We can deepen the interpretation and understanding of the model from the similarities and differences in the features selected in each group.
Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study
Hornung, Roman, Ludwigs, Frederik, Hagenberg, Jonas, Boulesteix, Anne-Laure
The generation of various types of omics data is becoming increasingly rapid and cost-effective. As a consequence, there are more so-called multi-omics data becoming available, that is, high-dimensional molecular data of several types such as genomic, transcriptomic, or proteomic data measured for the same patients. In the last few years, several approaches to use these data for patient outcome prediction have been developed (see Hornung and Wright (2019) for an extensive literature review). Nevertheless, doubts have recently emerged as to whether there is benefit to using multi-omics data over simple clinical models (Herrmann et al., 2020). Regardless of their usefulness for prediction, multi-omics data from different sources that are used for the same prediction problem, for various reasons, often do not feature the exact same types of data. Most importantly, the data for which predictions should be obtained, that is, the test data, often do not contain the same data types as the data available for obtaining the prediction rule, that is, the training data (Krautenbacher et al., 2019). The training data is also frequently composed of subsets originating from different sources (e.g.
MLOps -- Understanding Data Drift. Types of Data Drifts and Monitoringโฆ
One of the important functions of MLOps engineers is to monitor the model performance. Data drift causes degradation in the model performance over a period of time. Let's discuss data drift and the steps we can take to detect it in detail. Data drift refers to changes in the data distribution over a period of time. Data drift can lead to poor model performance, because the model is being applied to data that is different from the data it was trained on.